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Abstract — A major challenge in sparsity pattern estimation 
is tliat small modes are difficult to detect in the presence of 
noise. This problem is alleviated if one can observe samples 
from multiple realizations of the nonzero values for the same 
sparsity pattern. We will refer to this as "diversity". Diversity 
comes at a price, however, since each new realization adds 
new unknown nonzero values, thus increasing uncertainty. In 
this paper, upper and lower bounds on joint sparsity pattern 
estimation are derived. These bounds, which improve upon 
existing results even in the absence of diversity, illustrate key 
tradeoffs between the number of measurements, the accuracy 
of estimation, and the diversity. It is shown, for instance, that 
diversity introduces a tradeoff between the uncertainty in the 
noise and the uncertainty in the nonzero values. Moreover, it is 
shown that the optimal amount of diversity significantly improves 
the behavior of the estimation problem for both optimal and 
computationally efficient estimators. 

I. Introduction 

An extensive amount of recent research in signal processing 
and statistics has focused on multivariate regression problems 
with sparsity constraints. One problem of particular interest, 
known as sparsity pattern estimation, is to determine which 
coefficients are nonzero using a limited number of observa- 
tions. Remarkably, it has been shown that accurate estimation 
is possible using a relatively small number of (possibly noisy) 
linear measurements, provided that the number of nonzero 
values is relatively small (see e.g. |r|-|3|). 

It has also been shown that the presence of additional struc- 
ture, beyond sparsity, can significantly alter the problem. Var- 
ious examples include distributed or model-based compressed 
sensing |l4l-||6l, estimation from multiple measurement vectors 
||2), simultaneous sparse approximation fS^I, model selection 
121, union support recovery | 10| , multi-task learning 1 11 1, and 
estimation of block-sparse signals lfT2l . lfT3l . 

In the present paper, we consider a joint sparsity pattern 
estimation framework motivated in part by the following 
engineering problem. Suppose that one wishes to estimate 
the sparsity patten of an unknown vector and is allowed to 
take either M noisy linear measurements of the vector itself, 
or spread the same number measurements amongst multiple 
vectors with same sparsity pattern as the original vector, but 
different nonzero values. This type of problem arises, for 
example, in magnetic resonance imaging where the vectors 
correspond to images of the same body part (common sparsity 
pattern) viewed with different contrasting agents (different 
nonzero values). 

T Also with the School of Computer and Communication Sciences, EPFL, 
Lausanne, Switzerland. 
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Fig. 1. Illustration of joint sparsity pattern estimation. The vectors Xj share a 
common sparsity pattern S but have independent nonzero values. The sparsity 
pattern S is estimated jointly using measurements vectors Y; corresponding 
to different measurement matrices Aw. 



On one hand, splitting measurements across different vec- 
tors increases the number of unknown values, potentially 
making estimation more difficult. On the other hand, using 
all measurements on a single vector has the risk that nonzero 
values with small magnitudes will not be detected. To under- 
stand this tradeoff, this paper bounds the accuracy of various 
estimators for the estimation problem illustrated in Figure [T] 
We refer to the number of vectors J as the "diversity". 

A. Overview of Contributions 

Several key contributions of this paper are the following: 

• Our analysis improves upon previous work in the single- 
vector setting [2], |3| and shows that there exists a sharp 
divide between knowing almost everything and knowing 
almost nothing about the sparsity pattern, even in problem 
regimes where exact recovery is impossible. 

• Our bounds are relatively tight for a large range of prob- 
lem parameters. Unlike bounds based on the restricted 
isometry property, they apply even when the number of 
measurements is small relative to the size of the sparsity 
pattern. 

• We show that the right amount of diversity is beneficial, 
but too much or too little can be detrimental (when the 
total number of measurements is fixed). Moreover, we 
show that diversity can significantly reduce the gap in 
performance between computationally efficient estima- 
tors, such the matched filter or LASSO, and estimators 
without any computational constraints. 

The remainder of the paper is outlined as follows. Theo- 
rem 1 gives a sufficient condition for a combinatorial estimator. 
Theorem 2 gives an information-theoretic necessary condition 
for any estimator. Theorem 3 gives a necessary and sufficient 
condition for a two-stage estimation architecture correspond- 
ing to either the matched filter, the LASSO, or the MMSE 
vector estimators, and Theorems 4-6 characterize various 



tradeoffs between the diversity, the number of measurements, 
the SNR, and the accuracy of estimation. 

Finally, we note that the joint estimation problem in this 
paper is closely related to the multiple measurement vector 
problem |T|, except that each vector is measured using a 
different matrix. Alternatively, our problem is a special case 
of block-sparsity lfT2ll . ifTSl with a block-sparse measurement 
matrix. Versions of our bounds for block-sparsity with dense 
measurement matrices can also be derived. 

B. Problem Formulation 

Let Xi,X2,--- , Xj G R" be a set of jointly random 
sparse vectors whose nonzero values are indexed by a common 
sparsity pattern S 



S={i:X,{i)^Q}, forj = l,2,...,J. 



(1) 



We assume that S is distributed uniformly over all subsets of 
{1, 2, • ■ • , n} of size k where k is known, and that the nonzero 
values are i.i.d. A/'(0, 10 

We consider estimation of S from measurement vectors 
Yi, Ya, • • • , Yj e M™ of the form 



forj = l,2, 



,J 



(2) 



where each A, e 



is a known matrix whose elements 



are i.i.d. 7V(0, 1) and W^ ^ Af{0,lmxm) is unknown noise. 
The estimation problem is depicted in Figure [T] The accuracy 
of an estimate S is assessed using the (normaUzed) distortion 
function 



d{S,S) = lmax{\S\S\,\S\S\ 



(3) 



where \S\S\ and |5\5| denote the number of missed detec- 
tions and false alarms respectively. 

Our analysis considers the high dimensional setting where 
the diversity J is fixed but the vector length n, sparsity k, 
and number of measurements per vector m tend to infinity. 
We focus exclusively on the setting of linear sparsity where 
k/n — s> K for some fixed sparsity rate n G (0, 1/2) and 
m/n — > r for some fixed per-vector sampling rate r > 0. 
The total number of measurements is given by M = mj, 
and we use p — Jr to denote the total sampling rate. We 
say that a distortion a > is achievable for an estimator S if 
PT[d{S, S*) > a] — > as n — > oo. The case a — corresponds 
to exact recovery and the case a > corresponds to a constant 
fraction of errors. 

C. Notations 

For a matrix A and set of integers S we use ^4(5') to denote 
the matrix formed by concatenating the columns of A indexed 
by S. We use Hb{p) — —plogp— (1 — p) log(l —p) to denote 
binary entropy and all logarithms are natural. 

'The results in this paper extend to any i.i.d. distribution with bounded 
second moment. Due to space constraints, only the Gaussian case is presented. 



II. Joint Estimation Bounds 

This section gives necessary and sufficient conditions for the 
joint sparsity pattern estimation problem depicted in Figure [T] 

One important property of the estimation problem is the 
relative size of the smallest nonzero values, averaged across 
realizations. For a given fraction j3 G [0, 1], we define random 
variable 

P^/\p)=.r,^^^rnln J±\\X,iA)r. (4) 

By the Glivenko-Cantelli theorem, Pj\j3) converges almost 
surely to a nonrandom limit Pj{l3). We will refer to this limit 
as the diversity power. If the nonzero values are Gaussian, as 
is assumed in this paper, it can be shown that 

Pj{(3) ^ lo ^j{p)dp (5) 

""^''^ Up)-{t:n^x]<t]=p} (6) 

denotes the quantile function of a normalized chi-square 
random variable with J degrees of freedom. 

Another important property is the metric entropy rate (in 
nats per vector length) of S with respect to our distortion 
function d{S, S). In ||2l, it is shown that this rate is given by 

R{k, a) = H{k) - nHbia) - (l-K)i/h(j^) (7) 
for all a < 1 — K and is equal to zero otherwise. 
A. Nearest Subspace Upper Bound 

We first consider the nearest subspace (NS) estimator which 
is given by j 

S'^s = arg rnin ^ dist(Y„ A,(5))2 (8) 

o . p — fc 

where dist(Yj, Aj(S')) denotes the euclidean distance be- 
tween Yj and the linear subspace spanned by the columns of 
Aj{S). (For the case J = 1, this estimator is known variously 
throughout the literature as Iq minimization or maximum 
likelihood estimation.) 

Theorem 1. For a given set (k, SNR,/?, J), a distortion a is 
achievable for the nearest subspace estimator if 

p> kJ+ max min {Ei{l3), E2{P)) (9) 



3g[q,1] 



where 



^^(^) ^ 2H,{k) -2R{k,P) + 2pKj\og{b/i) ^^^^ 



E2{I3) = 



ilog(l + 4jPj(/3)SNR) 

2iJb(«)-2i?(K,/3) 



(11) 



log(l + Pi(/?)SNR) + l/(Pi(/3)SNR) -1 
with Pj{-) given by Q and i?(-, •) given by (|7]). 

Theorem[T]is a combination of two bounds. The part due to 
Ei{/3) determines the scaling behavior at low distortions and 
low SNR and the part due to E-zifi) determines the scaling 
behavior at high SNR. One important property of Ei{j3) is 
that its denominator scales linearly with the effective power 
of the Pj{(3) SNR when when (3 is small. As a consequence. 
Theorem [T] closes a gap in previous bounds for the case J = 1 
and correctly characterizes the boost in performance due to the 
diversity when J > 1. 



B. Optimal Estimation 

We next consider an information-theoretic lower bound on 
the distortion for any estimator. This bound depends on the 
entropy of the smallest nonzero values. For a given fraction 
/? G [0,1], we define the conditional entropy power 

^f{|3)^^^exp{-2h{U\U'<^^{(3))} (12) 

where h{-) is differential entropy and U ^ Af{0, 1). 

Theorem 2. For a given set (k, SNR, p, J), a distortion a is 
not achievable for any estimator if 



max 

/3e[o 

where 



\. {^(l^fe' f ) - J^^ (Ai(/3), A2(/3)) } > (13) 



Ai(/3)=Vi(t37^,^J(/3)SNR) (14) 

A2(/3) = Vi(t3^,/31-V^Pi(/3V./)snr) 

-T3!t^V2(^,/3A/-(/3V.0snr) (15) 



with 



Vi(r,7)= ? 



_.§log(l + 7), ifr<l 



log(l + rj), if r > 1 



(16) 



[^log(l + r7A(i)), ifr>l 

and A{r) = e-^{l - ry-^/^ 

Theorem |2] is also a combination of two bounds. The part 
due to Ai (/3) determines the scaling behavior at low distortions 
and low SNR and the part due to A2(/3) determines the 
scaling behavior at high SNR. As was the case for the nearest 
subspace upper bound, this bound is inversely proportional to 
the effective power Pj (/3) SNR when the effective power is 
small. 

III. Two-Stage Estimation Bounds 

This section gives bounds for the two-stage estimation 
architecture depicted in Figure |2] In the first stage, each 
vector Xj is estimated from its measurements Yj. In the 
second stage, the sparsity pattern S is estimated by jointly 
thresholding estimates Xi,X2,-- ,X,/. One advantage of 
this architecture is that the estimation in the first stage can be 
done in parallel. We will see that this architecture can be near 
optimal in some settings but is highly suboptimal in others. 

A. Single-Vector Estimation 

Three different estimators are considered: the matched filter 
(MF), the LASSO, and the minimum mean squared error esti- 
mator (MMSE). Recent results have shown that the asymptotic 
behavior of these estimators can be characterized in terms of 
an equivalent scalar estimation problem. Since these results 
correspond to the case J = 1, we use the notation X and 
Y and use the per-vector sampling rate r instead of the 
total sampling rate p. Also, we define the sparse Gaussian 
distribution 

F,(x) = At/"^ -^e-'^du + (1 - n)l{x < 0) (18) 



S 



Xi ^(Yi, Ai)^ est ^Xi -^ 




X2^(Y2,A2)^est ^±2^ 


Joint 




thresholder 


Xj^(Yj,Aj)^ est ^Xj^ 





►5 



Fig. 2. Illustration of single-vector estimation followed by joint thresholding. 

which corresponds to the marginal distribution of X{i). 

The first result characterizes the asymptotic behavior of the 
matched filter which is given by 



= i./JLA^Y. 



(19) 



To our knowledge, this result was first shown (with conver- 
gence in probability) in Q, lfT4l . Almost sure convergence 
follows from recent tools developed in IfTSl . 

Proposition 1 (Matched Filter). The empirical distribution on 
the elements of (X, X'^^) converges weakly and almost surely 
to the distribution on (X, X + aW) where X ^^ F^ and W ~ 
A/'(0, 1) are independent and 

1 



SNR 



+ 1 



(20) 



The next result, due to Donoho et al. lfT6l and Bayati 
and Montanari, lITSJI . describes the asymptotic behavior of the 
LASSO which is given by 



XLASSO ^ ^^g ij^f 1 ||Y _ . /^Ax||2 

where A > is a regularization parameter. 



A!|x|h 



(21) 



Proposition 2 (LASSO). The emprical distribution on the 
elements of (X, X'"^^^'^) converges weakly almost surely to 
the distribution on [X,rjt{X + aW)) where X '^ F^ and 
W '^ A/'(0, 1) are independent, rit{x) = [x — sign(x)i]l(|a;| > 
t), and a^ and t are given by the fixed point equations 



a = — 
r 

1 

t = - 
r 



SNR 



SNR '-' ' ^ 



(22) 
(23) 



The final result, based on the work of Guo and Verdu iflTJI . 
characterizes the asymptotic behavior of the MMSE which is 
given by 

X^"SE = E[X|Y]. (24) 

This result depends on a powerful but non-rigorous replica 
method, and is thus stated as a claim. 

Claim 1 (MMSE). The distribution on the elements of 
(X, X^MSE^ converges weakly in expectation to the distribu- 
tion on (X, ¥.[X\X + (tW]) where X ^ F^^andW ^ M{0, 1) 
are independent and a^ is given by 

(j^ = argmin jrlogcr^H ^ + 2/(X; X + crVF)). 

^-2^r,L <= SNRcr^ ^ '^ 

(25) 
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4. Bounds on the distortion a as a function of the total sampling rate p = Jr for various J when SNR = 40 dB and k = 10"'*. 



B. Thresholding 

For the second stage of estimation we consider the joint 
thresholding sparsity pattern estimator given by 



^™ = {-EL^|«>0 



(26) 



where the threshold i > is chosen to minimize the ex- 
pected distortion. Since this estimator evaluates each index 
i G {1,2,-- ,n} independently, and since the estimated vec- 
tors Xi, X2, • • • , Xj are conditionally independent given the 
sparsity pattern S, the distribution on the distortion d{S, S^^) 
can be characterized by joint distribution on (Xi(l),Xi(l)). 
The following result describes the relationship between the 
distortion a, the diversity J, and the effective noise power a^. 

Theorem 3. Suppose that for j = 1, 2, • • • , J, the empirical 
joint distributions on the elements of (Xj,Xj) converge 
weakly to one of the scalar distributions corresponding to the 
matched filter (Proposition^}, the LASSO (Proposition^, or 
the MMSE (ClaimU} with noise power a^. Then, a distortion 
a is achievable for the thresholding estimator (126b if and only 
if a^ > crj(a) where 

'^^^"^ = 0(l-f^)-0(a) ^''^ 

with S,j{a) given by (|6]l. 

Theorem [3] shows that the relationship between a and J is 
encapsulated by the term crj(a). With at bit of work it can 
be shown that the numerator and denominator in (|27] | scale 
like a^'^Pj{a) and a^^R{K, a) respectively when a is small. 
Thus, plugging crj(a) into the equivalent noise expression of 
the matched filter given in ( |20] i shows that bounds attained 
using Theorem |3] have similar low distortion behavior to the 
bounds in Section HH 



One advantageous property of Theorem[3]is that the bounds 
are exact. As a consequence, these bounds are sometimes 
lower than the upper bound in Theorem [l] which is loose 
in general. One shortcoming however, is that the two-stage 
architecture does not take full advantage of the joint structure 
during the first stage of estimation. As a consequence, the 
performance of these estimators can be highly suboptimal, 
especially at high SNR. 

IV. Sampling - Diversity Tradeoff 

In this section, we analyze various behaviors of the bounds 
in Theorems 1, 2, and 3, with an emphasis on the tradeoff pro- 
vided by the diversity J. The following results characterize the 
high SNR and low distortion behavior of optimal estimation. 

Theorem 4 (High SNR). Let (k, J, a), be fixed and let /9(SNR) 
denote the infimum over sampling rates p such that a is 
achievable for the optimal estimator Fix any e > 0. 
(a) If a > 0, then 



p(SNR) < Jk 

for all SNR large enough, 
(b) If 2R{k, a) > Jk, then 

p(SNR) > Jk ^ 
for all SNR large enough. 



2HbiK){l + 
log SNR 



2R{K,a){l-e) 
log SNR 



(28) 



(29) 



Theorem 5 (Low Distortion). Let {k, J, SNR) be fixed and let 
p{a) denote the infimum over sampling rates p such that a 
is achievable for the optimal estimator. There exist constants 
< C~ < C+ < oo such that 



\2/J 



2/J, 



C-(^)^^^loge)<p(a)<C+(l)^^'Mog(i) 



(30) 
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Fig. 5. The upper bound (Theorem [T) on the total sampling rate p = Jr of 
the nearest subspace estimator as a function of the distortion a for various J 
when SNR = 40 dB and k = 10""'. 



for all a small enough. 

Theorems |4] and |5] illustrate a tradeoff. At high SNR, the 
difficulty of estimation is dominated by the uncertainty about 
the nonzero values. Accordingly, the number of measurements 
is minimized by letting J = 1. As the desired distortion 
becomes small however, the opposite behavior occurs. Since 
estimation is limited by the size of the smallest nonzero values, 
it is optimal to choose J large to increase the diversity power. 
This behavior can be seen, for example, in Figures 3-6. 

A natural question then, is how does one best choose the 
diversity J? The following result shows that the right amount 
of diversity can significantly improve performance. 

Theorem 6. Let (KjSNR) be fixed and let p{a, J) denote the 
infimum over sampling rates p such that a is achievable with 
diversity J. Then, 

p{a,J)<Kj + 0{p^). (31) 

Moreover, if J ~ J*{<^) ~ 0(log(l/a) then 

p(a,J*(a))=e(log(l/a)). (32) 

An important implication of Theorem |6] is that the optimal 
choice of J allows the distortion to decay exponentially rapidly 
with the sampling rate p. Note that the rate of decay is only 
polynomial if J is fixed. Interestingly, it can also be shown that 
the same exponential boost can be obtained using non-optimal 
estimators, albeit with smaller constants in the exponent. 

The effect of the diversity J is illustrated in Fig. |5]for the 
nearest subspace estimator and in Fig. |6] for Lasso -i- thresh- 
olding. In both cases, the bounds show the same qualitative 
behavior-each value of the diversity J traces out a different 
curve in the sampling rate distortion region. It is important to 
note however, that due to the sub-optimality of the two stage 
architecture and the LASSO estimator, these similar behaviors 
occur only at different SNRs and with an order of magnitude 
difference in the sampling rate. 
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Fig. 6. The upper bound (Theorem [5) on the total sampUng rate p = Jr of 
LASSO + Joint Thresholding as a function of the distortion a for various J 
when SNR = 30 dB and k = lO""'. 
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